Goto

Collaborating Authors

 Query Processing



SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark

Neural Information Processing Systems

Electronic health records (EHRs) are stored in various database systems with different database models on heterogeneous storage architectures, such as relational databases, document stores, or graph databases. These different database models have a big impact on query complexity and performance. While this has been a known fact in database research, its implications for the growing number of Text-to-Query systems have surprisingly not been investigated so far. In this paper, we present SM3-Text-to-Query, the first multi-model medical Text-to-Query benchmark based on synthetic patient data from Synthea, following the SNOMED-CT taxonomy--a widely used knowledge graph ontology covering medical terminology. SM3-Text-to-Query provides data representations for relational databases (PostgreSQL), document stores (MongoDB), and graph databases (Neo4j and GraphDB (RDF)), allowing the evaluation across four popular query languages, namely SQL, MQL, Cypher, and SPARQL. We systematically and manually develop 408 template questions, which we augment to construct a benchmark of 10K diverse natural language question/query pairs for these four query languages (40K pairs overall). On our dataset, we evaluate several common in-context-learning (ICL) approaches for a set of representative closed and open-source LLMs.



Scalable and Efficient Non-adaptive Deterministic Group Testing

Neural Information Processing Systems

Group Testing (GT) is about learning a (hidden) subset K, of size k, of some large domain N, of size n k, using a sequence of queries. A result of a query provides some information about the intersection of the query with the unknown set K. The goal is to design efficient (polynomial time) and scalable (polylogarithmic number of queries per element in K) algorithms for constructing queries that allow to decode every hidden set K based on the results of the queries. A vast majority of the previous work focused on randomized algorithms minimizing the number of queries; however, in case of large domains N, randomization may result in a significant deviation from the expected precision of learning the set K. Others assumed unlimited computational power (existential results) or adaptiveness of queries (next query could be constructed taking into account the results of the previous queries) - the former approach is less practical due to non-efficiency, and the latter has several drawbacks including non-parallelization. To avoid all the abovementioned drawbacks, for Quantitative Group Testing (QGT) where query result is the size of its intersection with the hidden set, we present the first efficient and scalable non-adaptive deterministic algorithms for constructing queries and decoding a hidden set K from the results of the queries - these solutions do not use any randomization, adaptiveness or unlimited computational power.


Exploratory Retrieval-Augmented Planning For Continual Embodied Instruction Following, Wei-jin Park 2

Neural Information Processing Systems

This study presents an Exploratory Retrieval-Augmented Planning (ExRAP) framework, designed to tackle continual instruction following tasks of embodied agents in dynamic, non-stationary environments. The framework enhances Large Language Models' (LLMs) embodied reasoning capabilities by efficiently exploring the physical environment and establishing the environmental context memory, thereby effectively grounding the task planning process in time-varying environment contexts. In ExRAP, given multiple continual instruction following tasks, each instruction is decomposed into queries on the environmental context memory and task executions conditioned on the query results. To efficiently handle these multiple tasks that are performed continuously and simultaneously, we implement an exploration-integrated task planning scheme by incorporating the informationbased exploration into the LLM-based planning process. Combined with memoryaugmented query evaluation, this integrated scheme not only allows for a better balance between the validity of the environmental context memory and the load of environment exploration, but also improves overall task performance. Furthermore, we devise a temporal consistency refinement scheme for query evaluation to address the inherent decay of knowledge in the memory. Through experiments with VirtualHome, ALFRED, and CARLA, our approach demonstrates robustness against a variety of embodied instruction following scenarios involving different instruction scales and types, and non-stationarity degrees, and it consistently outperforms other state-of-the-art LLM-based task planning approaches in terms of both goal success rate and execution efficiency.


Efficient Pure Exploration in Adaptive Round model

Neural Information Processing Systems

In the adaptive setting, many multi-armed bandit applications allow the learner to adaptively draw samples and adjust sampling strategy in rounds. In many real applications, not only the query complexity but also the round complexity need to be optimized. In this paper, we study both PAC and exact top-k arm identification problems and design efficient algorithms considering both round complexity and query complexity.


On Margin-Based Cluster Recovery with Oracle Queries

Neural Information Processing Systems

We study an active cluster recovery problem where, given a set of n points and an oracle answering queries like "are these two points in the same cluster?", the task is to recover exactly all clusters using as few queries as possible. We begin by introducing a simple but general notion of margin between clusters that captures, as special cases, the margins used in previous works, the classic SVM margin, and standard notions of stability for center-based clusterings. Under our margin assumptions we design algorithms that, in a variety of settings, recover all clusters exactly using only O(log n) queries.


Appendices Appendices 1 A Basic Facts about Gaussian Distributions 2 B An Improved Analysis of NA-Hutch+ + 3 C Lower Bounds 8 C.1 Case 1: Lower Bound for Small ษ›

Neural Information Processing Systems

We let G N (n) denote an n n random Gaussian matrix with i.i.d. Then Rg has the same distribution as g. Recall that NA-Hutch++ splits its matrix-vector queries between computing an O(1)- approximate rank-k approximation รƒ and performing Hutchinson's estimate on the residual matrix A รƒ. The key to an improved query complexity of NA-Hutch++ is on the analysis of the size of random Gaussian sketching matrices S, R in Algorithm 1 that one needs to get an O(1)-approximate rank-k approximation รƒ in the Frobenius norm. To get the desired rank-k approximation, we need S and R to satisfy two properties: 1) subspace embedding as in Lemma 3.3 and 2) approximate matrix product for orthogonal subspaces as in Lemma 3.4. Specifically, we show in Lemma 3.4 that choosing S and R to be of size O(k + log(1/ฮด)) suffices to get the second property with probability 1 ฮด.


Contextual Active Model Selection, Rick L. Stevens 1,2

Neural Information Processing Systems

While training models and labeling data are resource-intensive, a wealth of pretrained models and unlabeled data exists. To effectively utilize these resources, we present an approach to actively select pre-trained models while minimizing labeling costs. We frame this as an online contextual active model selection problem: At each round, the learner receives an unlabeled data point as a context. The objective is to adaptively select the best model to make a prediction while limiting label requests. To tackle this problem, we propose CAMS, a contextual active model selection algorithm that relies on two novel components: (1) a contextual model selection mechanism, which leverages context information to make informed decisions about which model is likely to perform best for a given context, and (2) an active query component, which strategically chooses when to request labels for data points, minimizing the overall labeling cost. We provide rigorous theoretical analysis for the regret and query complexity under both adversarial and stochastic settings. Furthermore, we demonstrate the effectiveness of our algorithm on a diverse collection of benchmark classification tasks. Notably, CAMS requires substantially less labeling effort (less than 10%) compared to existing methods on CIFAR10 and DRIFT benchmarks, while achieving similar or better accuracy.